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Abstract 

A time series model based on the global structure of the complete genome is 
proposed. Three kinds of length sequences of the complete genome are considered. 
The correlation dimensions and Hurst exponents of the length sequences are calcu- 
lated. Using these two exponents, some interesting results related to the problem 
of classification and evolution relationship of bacteria are obtained. 
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1 Introduction 

The nucleotide sequences stored in GenBank have exceeded hundreds of millions of bases 
and they increase by ten times every five years. A great deal of information concerning 
origin of life, evolution of species, development of individuals, and expression and regu- 
lation of genes, exist in these sequences^. In the past decade or so there has been an 
enormous interest in unravelling the mysteries of DNA. It has become very important 
to improve on new theoretical methods to do DNA sequence analysis. Statistical analy- 
sis of DNA sequences^ 1-9 ] using modern statistical measures is proven to be particularly 
fruitful. There is another approach to research DNA, namely nonlinear scales method, 
such as fractal dimensionE 0' , complexity0' . The correlation properties of 
coding and noncoding DNA sequences was first studied by Stanley and coworkersB in 
their "fractal landscape or DNA walk" model. The DNA walk defined in || is that the 
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walker steps "up" if a pyrimidine (C or T) occurs at position i along the DNA chain, 
while the walker steps "down" if a purine (A or G) occurs at position i. Stanley and 
coworkers!!' discovered there exists long-range correlation in noncoding DNA sequences 
while the coding sequences correspond to regular random walk. But if one considers more 
details by distinguishing C from T in pyrimidine, and A from G in purine (such as two 
or three dimensional DNA walk modelOJ and maps given in flip, then the presence of 
base correlation has been found even in coding region. However, DNA sequences are more 
complicated than those these types of analysis can describe. Therefore, it is crucial to 
develop new tools for analysis with a view toward uncovering mechanisms used to code 
other types of information. 

Since the first complete genome of the free-living bacterium Mycoplasma genitalium 



was sequenced in 1995", an ever-growing number of complete genomes has been de- 
posited in public databases. The availability of complete genomes opens the possibility to 
ask some global questions on these sequences. The avoided and under-represented strings 
in some bacterial complete genomes have been discussed in [[0|, |1| |19|| . A time series 



model of CDS in complete genome has also been proposed in [15 



One can ignore the composition of the four kind of bases in coding and noncod- 
ing segments and only consider the roughly structure of the complete genome or long 
DNA sequences. Provata and Almirantis ^-J proposed a fractal Cantor pattern of DNA. 
They map coding segments to filled regions and noncoding segments to empty regions of 
random Cantor set and then calculate the fractal dimension of the random fractal set. 
They found that the coding/noncoding partition in DNA sequences of lower organisms 
is homogeneous-like, while in the higher eucariotes the partition is fractal. This result 
is interesting and reasonable, but it seems too rough to distinguish bacteria because the 
fractal dimensions of bacteria they gave out are all the same. The classification and evo- 
lution relationship of bacteria is one of the most important problem in DNA research. In 
this paper, we propose a time series model based on the global structure of the complete 
genome and we find that one can get more information from this model than that of the 
fractal Cantor pattern. We have found some new results to the problem of classification 
and evolution relationship of bacteria. 

A DNA sequence is a sequence over the alphabet {A, C, G, T} representing the four 
bases from which DNA is assembled, namely adenine, cytosine, guanine, and thymine. 
But from views of the level of structure, the complete genome of organism is made up of 
coding and noncoding segments. Here the length of a coding/noncoding segment means 
the number of its bases. First we simply count out the lengths of coding/noncoding 
segments in the complete genome. Then we can get three kinds of integer sequences by 
the following ways. 

i) First we order all lengths of coding and noncoding segments according to the order 
of coding and noncoding segments in the complete genome, then replace the lengths of 
noncoding segments by their negative numbers. So that we can distinguish lengths of 
coding and noncoding segments. This integer sequence is named whole length sequence. 
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Figure 1: Length and distribution of coding segments in the complete genome or Chro- 
mosome of some organisms. 

ii) We order all lengths of coding segments according to the order of coding segments 
in the complete genome. We name this integer sequence coding length sequence. For some 
examples, we plot the distribution of coding length sequences of three bacteria genome 
and the 4th chromosome of Saccharomyces cerevisiae (yeast) in Figure [I]. 

iii) We order all lengths of noncoding segments according to the order of noncoding 
segments in the complete genome. This integer sequence is named noncoding length 
sequence. 

We can now view these three kinds of integer sequences as time series. We want to 
calculate their correlation dimensions and Hurst exponents. 



2 Correlation dimension and Hurst exponent 

The notion of correlation dimension, introduced by Grassberger and ProcacciaEI 
suits well experimental situations, when only a single time series is available. It is now 
being used widely in many branches of physical science. Consider a sequence of data from 
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a computer or laboratory experiment: 



Xi, x 2 , x 3 , ■ ■ ■ , X N , [L) 

where N is a large enough number. These numbers are usually sampled at an equal time 
interval At. We embed the time series into R m , choose a time delay r = pAr, then 
obtain 

Yi C^ij -F-i+p-i -Ei+2pi ' i •Ei+(m— l)p) ) ^ 1; 2, • ■ ■ , N m , (2) 

where 

N m = N - (m - l)p. (3) 

In this way we get N m vectors in the embedding space R m . 
For any y^y,-, we define the distance as 



TO— 1 



r 



d(yi, Yj) = £ k<+jp - a^+jpl- (4) 



«=0 



If the distance is less than a given number r, we say that these two vectors are correlated. 
The correlation integral is defined as 

i N m 

C ™(r) = W E (5) 
where H is the Heaviside function 

^ ) = \0, ifx<0. (6) 

For a proper choice of m and not too big a value of r, it has been shown by Grassberger 
and Procaccia£3 that the correlation integral C m (r) behaves like 

C m (r) oc r D ^ m \ (7) 

Thus one can define the correlation dimension as 

D 2 = lim D 2 (m) = lim lim ln ^ r \ ( 8 ) 

m >oo m >oo r *0 In T 

For more details on D 2 , the reader can refer to 



To deal with practical problems, one usually choose p — 1. If we choose a sequence 
{r, : 1 < % < n} such that r\ < r 2 < r 3 < ■ ■ ■ < r n , then a scaling region can be found 
in the lnr — In C m (r) plane, see p3 |, p. 346. Then the slop of the scaling region is D 2 {m). 
When D 2 (m) does not change with m increasing, we can take this D 2 (mo) as the estimate 
value of D 2 . We calculate the correlation dimensions of three kinds of length sequences 
of the complete genome using the method introduced above. From the lnr — lnC m (r) 
figures of these sequences of different values of embedding dimension m, we find that it is 
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Figure 2: Left) lnr-lnC m (r) figure of the length sequence of coding and noncoding 
segments of A. fulgidus when m=6,7. Right) Estimate of the correlation dimension (the 
continuous line). 



suitable to choose m = 7. For example, we give the lnr — In C m {r) figure of whole length 
sequence of A. fulgidus when m = 6, 7 (Figure |2|). We take the region from the third 
point to the 20th point (from left to right) as the scaling region. 

Hurst @ invented the now famous statistical method — the rescaled range analysis 
(R/S analysis) to study the long-range dependence in time series. Later on, B. B. Mandelbrot 11 
and J. Feder EH brought R/S analysis into fractal analysis. For any time series x = 
{xk}k=i and any 2 < n < N, one can define 

1 - 

<i>„=-Vij (9) 

i 

X(i, n) = J2 \ x u- < x > n ] (10) 

u=l 

R(n) = max X(i,n) — min X(i,n) (11) 

l<i<n l<i<n 

%) = [i^-<x> n ) 2 ] 1/2 - (12) 
n i=i 

Hurst found that 



R(n)/S(n) ~ (-f. (13) 
H is called the Hurst exponent. 

As n changes from 2 to N, we obtain — 1 points in the ln(n) v.s. ln(R(n) / S(ri)) 
plane. Then we can calculate the Hurst exponent H of the length sequence of organisms 
using the least-squares linear fit. As an example, we plot the graph of R/S analysis of 
the whole length sequence of A. fulgidus in Figure 0. 
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Figure 3: Calculation of Hurst exponent. 



The Hurst exponent is usually used as a measure of complexity. The trajectory of 
the record is a curve with fractal dimension D = 2 — H ([25], p. 149). Hence a smaller H 
means a more complex system. When applied to fractional Brownian motion, the system 
is said to be persistent if H > 1/2, which means that if for a given time period t, the 
motion is along one direction, then in a succeeding time, it is more likely that the motion 
will follow the same direction. For H < 1/2, the opposite holds, that is, the system is 
antipersistent. But when H = 1/2, the system is a Brownian motion, and is random. 



3 Data and results. 

More than 21 bacterial complete genomes are now available in public databases . 
There are five Archaebacteria: Archaeoglobus fulgidus, Pyrococcus abyssi, Methanococ- 
cus jannaschii, Aeropyrum pernix and Methanobacterium thermoautotrophicum; four 
Gram-positive Eubacteria: Mycobacterium tuberculosis, Mycoplasma pneumoniae, My- 
coplasma genitalium, and Bacillus subtilis. The others are Gram-negative Eubacteria. 
These consist of two Hyperthermophilic bacteria: Aquifex aeolicus and Thermotoga mar- 
itima; six proteobacteria: Rhizobium sp. NGR234, Escherichia coli, Haemophilus in- 
fluenzae, Helicobacter pylori J99, Helicobacter pylori 26695 and Rockettsia prowazekii; 
two chlamydia Chlamydia trachomatis and Chlamydia pneumoniae, and two Spirochete: 
Borrelia burgdorferi and Treponema pallidum. 

We calculate the correlation dimensions and Hurst exponents of three kinds of length 
sequences of the above 21 bacteria. The estimated results are given in Table [I] ( we denote 
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Table 1: D 2jW hoie, D 2 , cod and D %noncod of 21 bacteria. 
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by D 2iW hoie> D 2 cod and D 2 ^ noncod the correlation dimensions of whole, coding and noncoding 
length sequences, from top to bottom, in the increasing order of the value of D 2cod ) and 
Table ^ ( we denote by H w hoie, H cod and H noncod the Hurst exponents of whole, coding 
and noncoding length sequences, from top to bottom, in the increasing order of the value 
of H cod ). 



4 Discussion and conclusions 

Although the existence of the archaebacterial urkingdom has been accepted by many 



biologists, the classification of bacteria is still a matter of controversy^. The evolu- 
tionary relationship of the three primary kingdoms (i.e. archeabacteria, eubacteria and 
eukaryote) is another crucial problem that remains unresolved 

From Table [l|, we can roughly divide bacteria into two classes, one class with D 2yCod less 
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Table 2: H who i e ,H cod and H noncod of 21 bacteria. 
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than 2.40, and the other with D 2 ^od greater than 2.40. We observe that the classification 
of bacteria using D2, C od almost coincides with the traditional classification of bacteria. 
All Archaebacteria belong to the same class. All Proteobacteria belong to the same class 
except Rhizobium sp. NGR234, in particular, the closest Proteobacteria Helicobacter 
pylori 26695 and Helicobacter pylori J99 group with each other. Two Spirochete group 
with each other. Two Chlamydia gather with each other. Gram-positive bacteria is 
divided into two sub-categories: Mycoplasma genitalium and Mycoplasma pneumoniae 
belong to one class and gather with each other, Mycobacterium tuberculosis and Bacillus 
subtilis belong to another class and almost gather with each other. 

If one classifies bacteria using D 2 , w hoie, with the D 2 ^ who i e of one subclass less than 2.55, 
that of the other larger than 2.55, almost the same results hold as those using D2, C od- But 
when one classifies bacteria using D2 t noncod, the results are quite different. This is quite 
reasonable because the coding segments occupy the main part of space of the DNA chain 
of bacteria. 

A surprising feature shown in Table p] is that the Hyperthermophilic bacteria (including 
Aquifex aeolicus and Thermotoga maritima) are linked closely with the Archaebacteria 
if we only consider the length sequences of noncoding segments. But when we consider 
the length sequences of coding segments, they are linked closely with eubacteria. We 
notice that Aquifex, like most Archaebacteria, is hyperthermophilic. Hence it seems that 
their hyperthermophilicity property is possibly controlled by the noncoding part of the 
genome, contrary to the traditional view resulting from classification based on the coding 
part of the genome. It has previously been shown that Aquifex has close relationship 
with Archaebacteria from the gene comparison of an enzyme needed for the synthesis of 
the amino acid trytophan£3 . Such strong correlation on the level of complete genome 
between Aquifex and Archaebacteria is not easily accounted for by lateral transfer and 
other accidental events^ . Our result is based on different levels of the genome from that 



used by the authors of [28 



From Table [1], one can also see the -D 2 , co d values are almost larger than the D2, n oncod 
values. Hence the coding length sequences are more complex than the noncoding length 
sequences. 

From Table ||, we can also roughly divide bacteria into two classes, one class with H C od 
less than 0.60, and the other with H co d greater than 0.60. One can see all Archeabacteria 
belong to the same class except Pyrococcus abyssi. All Gram-positive Eubacteria belong 
to the same class except Bacillus subtilis. All Proteobacteria belong to the same class 
except Haemophilus influenzae. Two Spirochete group with each other. Two Chlamydia 
almost group with each other. 

We also find the H non cod values of all Archeabacteria except Pyrococcus abyssi, two 
Hyperthermophilic bacteria, and Mycoplasma genitalium and E. coli are less than 1/2, 
while those of other bacteria are greater than 1/2. Hence Hyperthermophilic bacteria 
have some common information with Archaebacteria in noncoding segments. 

We calculate D 2 , wh oie, D 2 , d, £>%noncod H whole , H cod and H noncod of the 4th chromosome 
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of Saccharomyces cerevisiae (yeast). They are 2.5603, 2.1064, 2.5013, 0.5517, 0.6255 and 
0.5482 respectively. From Tables |I]and0, if we consider D^whoie-, H w hoie and H co d, we can 
see that Archaebacteria and Chlamydia are linked more closely with yeast which belongs 
to eukaryote than other categories of bacteria. There are several reports (such as [29| ) 
that, in some RNA and protein species, archeabacteria are much more similar in sequences 
to eukaryotes than to eubacteria. Our present result supports this point of view. 

In f[4jl , we find that the Hurst exponent is a good tool to distinguish different functional 
regions. But now considering more global structure of the genome, we find the correlation 
dimension a better exponent to use for classification of bacteria than the Hurst exponent 
in this level. 
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